-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Drastic performance improvements for reads (#249) #342
Conversation
if order_objects did not change
use chunk size and index to skip to expected chunk
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution @johannesloibl, this looks great. I've left some review comments to be addressed
Just to add some more context and make sure I'm understanding this right, can you provide an example of how you're reading a file? Are you reading a subset of the channels and for each channel, reading all data at once? |
Do you see a way to skip iterating through the data_objects and replace it by calculating the desired file offset? |
It should be possible to compute them when reading the segment metadata, although that might add a bit more memory overhead, and this wouldn't be needed for the case where you read all data up front so we'd probably want to disable that behaviour in that case. |
Ready to merge from my point of view ;) Thanks for the quick support! |
Thanks for the contribution! |
Possible solution for #249 .
See my comment here.
Reading a file with thousands of signals partially can take a huge amount of time (100k signals read in slices, took 7000s).
I managed to greatly improve the performance by at least 12x (on 100k signals) to 25x (on 10k signals):
What is still a problem: the reading time still scales O(n²) with the amount on channels and chunks, because the channel has to be found by iterating through the list of data objects AND chunks.
If i read only 10k signals, the improvement factor is currently ~25x, for 100k signals it's only ~12x.
You see most of the time is now spent in the
_read_channel_data_chunk
and_get_channel_number_values
functions. Once the channel is found, the search is ended (before it was iterating till the end). This mean for many signals the time to find the desired signal increases.We need to find a way to calculate where the channel position is in order to greatly speed this up,
but i don't know the internals with chunking not well enough to do this.
Before:
With my changes: